1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here

To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.

Run the code below to load the file:

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, 
           na = "***")

You have two objectives in this section:

  1. Select the year and the twelve month variables from the weather dataset. We do not need the others (J-D, D-N, DJF, etc.) for this assignment. Hint: use select() function.

  2. Convert the dataframe from wide to ‘long’ format. Hint: use gather() or pivot_longer() function. Name the new dataframe as tidyweather, name the variable containing the name of the month as month, and the temperature deviation values as delta.

tidyweather <- weather %>% 
  select (Year, Jan, Feb, Mar, Apr, May, 
          Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>%
  pivot_longer(cols = 2:13, #columns 3 to 5
               names_to = "Month",
               values_to = "delta")

Inspect your dataframe. It should have three variables now, one each for

  1. year,
  2. month, and
  3. delta, or temperature deviation.

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

glimpse(tidyweather)
## Rows: 1,704
## Columns: 3
## $ Year  <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
## $ Month <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "…
## $ delta <dbl> -0.35, -0.50, -0.23, -0.29, -0.05, -0.15, -0.18, -0.25, -0.23, -…
tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), Month, "1")),
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point(size = 0.8)+
  geom_smooth(color="red") +
  theme_bw() +
  labs (
    title = "Weather Anomalies"
  )

Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(Year>= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
  ))

glimpse(comparison)
## Rows: 1,692
## Columns: 7
## $ Year     <dbl> 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1…
## $ Month    <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"…
## $ delta    <dbl> -0.31, -0.22, -0.04, 0.00, 0.04, -0.32, 0.08, -0.04, -0.26, -…
## $ date     <date> 1881-01-01, 1881-02-01, 1881-03-01, 1881-04-01, 1881-05-01, …
## $ month    <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J…
## $ year     <dbl> 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1881, 1…
## $ interval <chr> "1881-1920", "1881-1920", "1881-1920", "1881-1920", "1881-192…

Inspect the comparison dataframe by clicking on it in the Environment pane.

Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.

ggplot(comparison, aes(x=delta, fill=interval))+
  geom_density(alpha=0.2) +   #density plot with tranparency set to 20%
  theme_bw() +                #theme
  labs (
    title = "Density Plot for Monthly Temperature Anomalies",
    y     = "Density"         #changing y-axis label to sentence case
  )+
  NULL

So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  
  # creating summaries for mean delta 
  # use `na.rm=TRUE` to eliminate NA (not available) values 
  summarise(annual_average_delta = mean(delta, na.rm=TRUE)) 

glimpse(average_annual_anomaly)
## Rows: 142
## Columns: 2
## $ Year                 <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1…
## $ annual_average_delta <dbl> -0.2808, -0.1733, -0.2083, -0.2742, -0.4217, -0.4…
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth() +
  
  #change to theme_bw() to have white background + black frame around plot
  theme_bw() +
  labs (
    title = "Average Yearly Anomaly",
    y     = "Average Annual Delta"
  )                         

1.2 Confidence Interval for delta

NASA points out on their website that

A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.

Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.

formula_ci <- comparison %>% 
  group_by(interval) %>% 
  summarise(mean_delta = mean(delta, na.rm=TRUE),
            median_delta = median(delta, na.rm=TRUE),
            sd_delta = sd(delta, na.rm=TRUE),
            count_delta = n(),
            se_delta = sd_delta / sqrt(count_delta),
            ci_delta_up = mean_delta + qt(.975, count_delta-1)*se_delta ,
            ci_delta_dw = mean_delta - qt(.975, count_delta-1)*se_delta 
            )
            
  # choose the interval 2011-present
  # what dplyr verb will you use? 

  # calculate summary statistics for temperature deviation (delta) 
  # calculate mean, SD, count, SE, lower/upper 95% CI
  # what dplyr verb will you use? 

#print out formula_CI
print(formula_ci)
## # A tibble: 5 × 8
##   interval     mean_delta median_delta sd_delta count_delta se_delta ci_delta_up
##   <chr>             <dbl>        <dbl>    <dbl>       <int>    <dbl>       <dbl>
## 1 1881-1920    -0.338           -0.33     0.236         480   0.0108    -0.317  
## 2 1921-1950    -0.0186          -0.025    0.233         360   0.0123     0.00546
## 3 1951-1980    -0.0000833        0.01     0.197         360   0.0104     0.0204 
## 4 1981-2010     0.468            0.475    0.315         360   0.0166     0.500  
## 5 2011-present  1.06             1.04     0.276         132   0.0240     1.11   
## # … with 1 more variable: ci_delta_dw <dbl>
# use the infer package to construct a 95% CI for delta

set.seed(1)

#Calculate the CI for the relevant interval...
#2011 - present

boot_weather_1 <- comparison %>%
  filter(interval == "2011-present") %>% 
  specify(response = delta) %>% 
  generate(reps=100, type="bootstrap") %>% 
  calculate(stat = "mean")

percentile_ci_1<- boot_weather_1 %>% 
  get_confidence_interval(level = 0.95, type = "percentile")
print(percentile_ci_1)
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.02     1.11

What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!

  • The data clearly shows us what has consistently been advocated by climate scientists: that our earth is significantly warming, and that those changes really started taking off after the industrial revolution. We used scatter plots, density graphs and confidence intervals to dissect and present the NASA climate data. Going in to the data, especially when grouped by month and viewing the variance in the climate delta observed, we can see a clear trend towards hotter summers, which we have all experienced ourselves. The greater variance in the non-summer months is in line with the claims by climate change deniers that an abnormally cold winter refutes climate change. The density plot underlines that the changes to our global climate have been especially drastic in the last two decades, hinting at the possibility that we are now on an exponential growth trajectory in our worldwide temperatures.

2 General Social Survey (GSS)

The General Social Survey (GSS) gathers data on American society in order to monitor and explain trends in attitudes, behaviours, and attributes. Many trends have been tracked for decades, so one can see the evolution of attitudes, etc in American Society.

gss <- read_csv(here::here("data", "smallgss2016.csv"), 
                na = c("", "Don't know",
                       "No answer", "Not applicable"))

We will be creating 95% confidence intervals for population parameters. The variables we have are the following:

  • hours and minutes spent on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.50 hours, this would be recorded as emailhr = 2 and emailmin = 30.
  • snapchat, instagrm, twitter: whether respondents used these social media in 2016
  • sex: Female - Male
  • degree: highest education level attained

2.1 Instagram and Snapchat, by sex

Can we estimate the population proportion of Snapchat or Instagram users in 2016?

  1. Create a new variable, snap_insta that is Yes if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and No if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.
  2. Calculate the proportion of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.
  3. Using the CI formula for proportions, please construct 95% CIs for men and women who used either Snapchat or Instagram
temp_gss <- gss %>% 
   mutate(snap_insta = case_when(
    snapchat == "Yes" | instagrm == "Yes" ~ "Yes",
    snapchat == "NA" | instagrm == "NA" ~ "NA",
    TRUE ~ "No"
  ))
prop_gss <- temp_gss %>% 
  filter(snap_insta != "NA") %>% 
  count(snap_insta) %>% 
  mutate(prop = n/sum(n))
temp_gss %>% 
  filter(snap_insta != "NA") %>% 
  mutate(snap_insta = snap_insta == "Yes") %>% 
  group_by(sex) %>% 
  summarise(mean_prop = mean(snap_insta),
            sd_prop = sd(snap_insta),
            count_prop = n(),
            se_prop = sd_prop / sqrt(count_prop),
            ci_prop_up = mean_prop + qt(.975, count_prop-1)*se_prop ,
            ci_prop_dw = mean_prop - qt(.975, count_prop-1)*se_prop 
            )
## # A tibble: 2 × 7
##   sex    mean_prop sd_prop count_prop se_prop ci_prop_up ci_prop_dw
##   <chr>      <dbl>   <dbl>      <int>   <dbl>      <dbl>      <dbl>
## 1 Female     0.419   0.494        769  0.0178      0.454      0.384
## 2 Male       0.318   0.466        603  0.0190      0.356      0.281
glimpse(temp_gss)
## Rows: 2,867
## Columns: 8
## $ emailmin   <chr> "0", "30", "NA", "10", "NA", "0", "0", "NA", "0", "NA", "0"…
## $ emailhr    <chr> "12", "0", "NA", "0", "NA", "2", "40", "NA", "0", "NA", "2"…
## $ snapchat   <chr> "NA", "No", "No", "NA", "Yes", "No", "NA", "Yes", "NA", "No…
## $ instagrm   <chr> "NA", "No", "No", "NA", "Yes", "Yes", "NA", "Yes", "NA", "N…
## $ twitter    <chr> "NA", "No", "No", "NA", "No", "No", "NA", "No", "NA", "No",…
## $ sex        <chr> "Male", "Male", "Male", "Female", "Female", "Female", "Male…
## $ degree     <chr> "Bachelor", "High school", "Bachelor", "High school", "Grad…
## $ snap_insta <chr> "NA", "No", "No", "NA", "Yes", "Yes", "NA", "Yes", "NA", "N…
glimpse(prop_gss)
## Rows: 2
## Columns: 3
## $ snap_insta <chr> "No", "Yes"
## $ n          <int> 858, 514
## $ prop       <dbl> 0.625, 0.375

2.2 Twitter, by education level

Can we estimate the population proportion of Twitter users by education level in 2016?.

  1. Turn degree from a character variable into a factor variable. Make sure the order is the correct one and that levels are not sorted alphabetically which is what R by default does.
  2. Create a new variable, bachelor_graduate that is Yes if the respondent has either a Bachelor or Graduate degree. As before, if the recorded value for either was NA, the value in your new variable should also be NA.
  3. Calculate the proportion of bachelor_graduate who do (Yes) and who don’t (No) use twitter.
  4. Using the CI formula for proportions, please construct two 95% CIs for bachelor_graduate vs whether they use (Yes) and don’t (No) use twitter.
  5. Do these two Confidence Intervals overlap?
#Question 1 - Change 'degree' from character variable to factor variable. 
#Levels not sorted alphabetically... 

gss2 <- gss
gss2$degree[gss2$degree == "Lt high school"] <- '1'
gss2$degree[gss2$degree == "High school"] <- '2'
gss2$degree[gss2$degree == "Junior college"] <- '3'
gss2$degree[gss2$degree == "Bachelor"] <- '4'
gss2$degree[gss2$degree == "Graduate"] <- '5'

gss2$degree <- as_factor(gss2$degree)

head(gss2,5)
## # A tibble: 5 × 7
##   emailmin emailhr snapchat instagrm twitter sex    degree
##   <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <fct> 
## 1 0        12      NA       NA       NA      Male   4     
## 2 30       0       No       No       No      Male   2     
## 3 NA       NA      No       No       No      Male   4     
## 4 10       0       NA       NA       NA      Female 2     
## 5 NA       NA      Yes      Yes      No      Female 5
#Question 2 - Create New Variable - 'bachelor_graduate' 
gss3 <- gss %>% 
   mutate(bachelor_graduate = case_when(
    degree == 'Bachelor' | degree == 'Graduate' ~ "*Yes*",
    degree == "NA" ~ "NA",
    TRUE ~ "*No*"
  ))

head(gss3,5)
## # A tibble: 5 × 8
##   emailmin emailhr snapchat instagrm twitter sex    degree      bachelor_gradua…
##   <chr>    <chr>   <chr>    <chr>    <chr>   <chr>  <chr>       <chr>           
## 1 0        12      NA       NA       NA      Male   Bachelor    *Yes*           
## 2 30       0       No       No       No      Male   High school *No*            
## 3 NA       NA      No       No       No      Male   Bachelor    *Yes*           
## 4 10       0       NA       NA       NA      Female High school *No*            
## 5 NA       NA      Yes      Yes      No      Female Graduate    *Yes*
#Question 3 - Calculate the proportion of bachelor_graduate that do and do not use twitter...
gss_twitter_prop <- gss3 %>% 
  filter(bachelor_graduate == "*Yes*") %>% 
  filter(twitter != "NA") %>%
  count(twitter) %>% 
  mutate(Proportion = n/sum(n) * 100) 

colnames(gss_twitter_prop) = c("Twitter", "Number", "Proportion [%]")
print(gss_twitter_prop)
## # A tibble: 2 × 3
##   Twitter Number `Proportion [%]`
##   <chr>    <int>            <dbl>
## 1 No         375             76.7
## 2 Yes        114             23.3
#Question 4 - Construct 95% CIs for the proportions of bachelor_graduates that use twitter.

gss3 %>% 
  filter(bachelor_graduate == "*Yes*") %>% 
  filter(twitter != "NA") %>%
  mutate(twitter = twitter == "Yes") %>% 
  summarise(Mean_prop = mean(twitter),
            SD_prop = sd(twitter),
            Count_prop = n(),
            SE_prop = SD_prop / sqrt(Count_prop),
            Upper_CI_prop = Mean_prop + qt(.975, Count_prop-1)*SE_prop ,
            Lower_CI_prop = Mean_prop - qt(.975, Count_prop-1)*SE_prop 
            )
## # A tibble: 1 × 6
##   Mean_prop SD_prop Count_prop SE_prop Upper_CI_prop Lower_CI_prop
##       <dbl>   <dbl>      <int>   <dbl>         <dbl>         <dbl>
## 1     0.233   0.423        489  0.0191         0.271         0.196
gss3 %>% 
  filter(bachelor_graduate == "*Yes*") %>% 
  filter(twitter != "NA") %>%
  mutate(twitter = twitter == "No") %>% 
  summarise(Mean_prop = mean(twitter),
            SD_prop = sd(twitter),
            Count_prop = n(),
            SE_prop = SD_prop / sqrt(Count_prop),
            Upper_CI_prop = Mean_prop + qt(.975, Count_prop-1)*SE_prop ,
            Lower_CI_prop = Mean_prop - qt(.975, Count_prop-1)*SE_prop 
            )
## # A tibble: 1 × 6
##   Mean_prop SD_prop Count_prop SE_prop Upper_CI_prop Lower_CI_prop
##       <dbl>   <dbl>      <int>   <dbl>         <dbl>         <dbl>
## 1     0.767   0.423        489  0.0191         0.804         0.729
  • Question 5: The Confidence Intervals for the proportions of bachelor and graduate students that do and do not use Twitter do NOT overlap. As a result of this, we can say with great confidence that a much larger proportion of bachelor/graduate students do not use twitter (when compared to those that do).

2.3 Email usage

Can we estimate the population parameter on time spent on email weekly?

  1. Create a new variable called email that combines emailhr and emailmin to reports the number of minutes the respondents spend on email weekly.
  2. Visualise the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical amount of time Americans spend on email weekly? Why?
  3. Using the infer package, calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly. Interpret this interval in context of the data, reporting its endpoints in “humanized” units (e.g. instead of 108 minutes, report 1 hr and 8 minutes). If you get a result that seems a bit odd, discuss why you think this might be the case.
  4. Would you expect a 99% confidence interval to be wider or narrower than the interval you calculated above? Explain your reasoning.
#Question 1 - New variable called email...
gss_email <- gss 

gss_email <- transform(gss_email, email = ifelse(gss_email$emailhr != "NA" | gss_email$emailmin != "NA", (as.numeric(gss_email$emailmin) + 60*as.numeric(gss_email$emailhr)), "NA"))

head(gss_email,5)
##   emailmin emailhr snapchat instagrm twitter    sex      degree email
## 1        0      12       NA       NA      NA   Male    Bachelor   720
## 2       30       0       No       No      No   Male High school    30
## 3       NA      NA       No       No      No   Male    Bachelor    NA
## 4       10       0       NA       NA      NA Female High school    10
## 5       NA      NA      Yes      Yes      No Female    Graduate    NA
#Question 2 - Visualize this distribution and find the mean + median amount of time Americans spend on email...

gss_email2 <- gss_email %>% 
  filter(email != "NA")
  gss_email2$email <- as.numeric(gss_email2$email)

  
mean_email <- mean(gss_email2$email)
median_email <- median(gss_email2$email)
print(mean_email)
## [1] 417
print(median_email)
## [1] 120
gss_email2 %>%
  ggplot(aes(x=as.numeric(email)))+ 
  geom_histogram(breaks = seq(min(as.numeric(gss_email2$email)), max(as.numeric(gss_email2$email)), 100)) +
  
  labs(title = "Weekly Email Usage for Americans", x = "Email Minutes", y = "Count (Frequency)") +
  NULL

Question 2 continued… As we can see the distribution is very skewed to the right (positively skewed). Because of the way the arithmetic mean is calculated, large positive outliers (that are clearly present in this distribution) will produce a large, biased mean. In this instance, the median is probably a better “average value”, because it is not distorted by (biased to) the large positive outliers.

#Question 3 - Using the `infer` package, calculate a 95% bootstrap confidence interval for the mean amount of time Americans spend on email weekly.
# Use the infer package to construct a 95% CI for delta

set.seed(1234)

boot_gss <- gss_email2 %>%
  specify(response = email) %>% 
  generate(reps=100, type="bootstrap") %>% 
  calculate(stat = "mean")

email_ci_1 <- boot_gss %>% 
  get_confidence_interval(level = 0.95, type = "percentile")

Hours = email_ci_1 %/% 60
Minutes = email_ci_1 %% 60

email_ci_humanized <- c(Hours[c(1)], Minutes[c(1)], Hours[c(2)], Minutes[c(2)] )
email_ci_humanized <- as.data.frame(email_ci_humanized)

colnames(email_ci_humanized) = c("Lower CI Hours", "Lower CI Minutes", "Upper CI Hours", "Upper CI Minutes")
print(email_ci_humanized)
##   Lower CI Hours Lower CI Minutes Upper CI Hours Upper CI Minutes
## 1              6             28.4              7             28.9
  • This result seems fairly reasonable. The confidence interval perhaps seems fairly wide, given the large number of data points, but this caould be due to the fact that there is large variation in the data, and that large positive outlying values (of which there are a few), will introduce bias and largely affect the mean.

Question 4 - Would you expect a 99% confidence interval to be wider or narrower than the interval you calculated above? Explain your reasoning.

  • We would expect the 99% confidence interval to be wider than the 95% interval. This is because the confidence interval is formulated using the Z/t level, which increases when you look for an interval of greater confidence. For example, the critical Z-value for a confidence interval of 95% is 1.96, but this value would be 2.58 for a 99% confidence interval. Because this critical value is multiplied by the standard error to form the confidence interval, we can expect the interval for 99% confidence to be wider. This is intuitive, because increasing the size of a given region naturally increases our confidence that the mean lies within this specified region/location.

3 Biden’s Approval Margins

As we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

glimpse(approval_polllist)
## Rows: 1,597
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate           <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate             <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade               <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize          <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population          <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight              <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove          <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve    <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions       <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id         <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate         <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp           <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
# Use `lubridate` to fix dates, as they are given as characters.

3.1 Create a plot

What I would like you to do is to calculate the average net approval rate (approve- disapprove) for each week since he got into office. I want you plot the net approval, along with its 95% confidence interval. There are various dates given for each poll, please use enddate, i.e., the date the poll ended.

Also, please add an orange line at zero. Your plot should look like this:

glimpse(approval_polllist)
## Rows: 1,597
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate           <chr> "1/24/2021", "1/24/2021", "1/25/2021", "1/25/2021"…
## $ enddate             <chr> "1/26/2021", "1/27/2021", "1/27/2021", "1/26/2021"…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Maris…
## $ grade               <chr> "B", "A", "B", "B", "B", "B", "B", "B+", "B", "B",…
## $ samplesize          <dbl> 1500, 1313, 1500, 2200, 15000, 9212, 1500, 906, 15…
## $ population          <chr> "lv", "a", "lv", "a", "a", "a", "lv", "a", "a", "a…
## $ weight              <dbl> 0.3225, 2.1893, 0.3025, 0.1205, 0.2739, 0.1520, 0.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 49.0, 49.0, 58.0, 54.0, 55.0, 50.0, 57.0, 54…
## $ disapprove          <dbl> 48.0, 35.0, 48.0, 32.0, 31.0, 31.5, 45.0, 37.0, 32…
## $ adjusted_approve    <dbl> 50.4, 50.0, 51.4, 56.5, 52.5, 53.5, 52.4, 56.3, 52…
## $ adjusted_disapprove <dbl> 41.9, 34.6, 41.9, 35.4, 34.4, 34.9, 38.9, 36.4, 35…
## $ multiversions       <chr> NA, NA, NA, NA, NA, "*", NA, NA, NA, NA, NA, NA, N…
## $ tracking            <lgl> TRUE, NA, TRUE, NA, TRUE, TRUE, TRUE, NA, TRUE, NA…
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74261, 74320, 74268, 74346, 74277, 74292, 74290, 7…
## $ question_id         <dbl> 139433, 139558, 139483, 139653, 139497, 139518, 13…
## $ createddate         <chr> "1/27/2021", "2/1/2021", "1/28/2021", "2/5/2021", …
## $ timestamp           <chr> "18:35:08 10 Sep 2021", "18:35:08 10 Sep 2021", "1…
#prepare the dataset

#calculate the margin
approval_polllist$enddate <- mdy(approval_polllist$enddate)
new_approval <- approval_polllist %>% 
  filter(subgroup=="Voters") %>% 
  mutate(net_approve = approve - disapprove, year = year(enddate), week = week(enddate))

glimpse(new_approval)
## Rows: 380
## Columns: 25
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "Voters", "Voters", "Voters", "Voters", "Voters", …
## $ modeldate           <chr> "9/10/2021", "9/10/2021", "9/10/2021", "9/10/2021"…
## $ startdate           <chr> "1/24/2021", "1/25/2021", "1/24/2021", "1/26/2021"…
## $ enddate             <date> 2021-01-26, 2021-01-27, 2021-01-27, 2021-01-28, 2…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Rasmu…
## $ grade               <chr> "B", "B", "A", "B", "A+", "B-", "B/C", "B+", "B", …
## $ samplesize          <dbl> 1500, 1500, 1153, 1500, 1002, 1200, 841, 945, 1500…
## $ population          <chr> "lv", "lv", "rv", "lv", "rv", "rv", "lv", "rv", "l…
## $ weight              <dbl> 0.322, 0.303, 2.003, 0.286, 1.945, 0.881, 1.115, 1…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48.0, 49.0, 50.0, 50.0, 58.0, 58.0, 56.1, 61.0, 49…
## $ disapprove          <dbl> 48.0, 48.0, 36.0, 45.0, 29.0, 35.0, 36.4, 39.0, 46…
## $ adjusted_approve    <dbl> 50.5, 51.5, 50.9, 52.5, 56.2, 57.2, 54.8, 56.8, 51…
## $ adjusted_disapprove <dbl> 42.6, 42.6, 35.5, 39.6, 32.7, 36.5, 36.7, 40.4, 40…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, TRUE, NA, NA, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74261, 74268, 74320, 74290, 74321, 74355, 74326, 7…
## $ question_id         <dbl> 139433, 139483, 139559, 139515, 139561, 139680, 13…
## $ createddate         <chr> "1/27/2021", "1/28/2021", "2/1/2021", "1/29/2021",…
## $ timestamp           <chr> "18:35:11 10 Sep 2021", "18:35:11 10 Sep 2021", "1…
## $ net_approve         <dbl> 0.0, 1.0, 14.0, 5.0, 29.0, 23.0, 19.7, 22.0, 3.0, …
## $ year                <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 20…
## $ week                <dbl> 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
#calculate the CI for average approval margin

margin_ci <- new_approval %>% 
  group_by(week) %>% 
  summarise(avg_approval_margin = mean(net_approve),
            sd_margin = sd(net_approve, na.rm=TRUE),
            count_margin = n(),
            se_margin = sd_margin / sqrt(count_margin),
            ci_margin_up = avg_approval_margin + qt(.975, count_margin-1)*se_margin ,
            ci_margin_dw = avg_approval_margin - qt(.975, count_margin-1)*se_margin
            )

glimpse(margin_ci)
## Rows: 33
## Columns: 7
## $ week                <dbl> 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ avg_approval_margin <dbl> 5.00, 13.84, 11.89, 12.85, 11.67, 9.29, 8.25, 11.5…
## $ sd_margin           <dbl> 6.38, 8.95, 9.78, 8.40, 9.49, 7.94, 6.94, 8.14, 10…
## $ count_margin        <int> 4, 14, 9, 13, 13, 14, 12, 11, 15, 10, 8, 15, 14, 1…
## $ se_margin           <dbl> 3.19, 2.39, 3.26, 2.33, 2.63, 2.12, 2.00, 2.45, 2.…
## $ ci_margin_up        <dbl> 15.15, 19.01, 19.40, 17.93, 17.40, 13.87, 12.66, 1…
## $ ci_margin_dw        <dbl> -5.1473, 8.6662, 4.3728, 7.7672, 5.9303, 4.7033, 3…
#plot the for average approval margin
margin_ci %>% 
  ggplot(aes(week)) +
  geom_ribbon(aes(ymin = ci_margin_up, ymax = ci_margin_dw), fill="grey", alpha = 0.5)+
  geom_line(aes(y=avg_approval_margin, group=1), color = "red", size = 0.3) +
  geom_point(aes(y=avg_approval_margin, group=1), color = "red", size  = 0.3) +
  geom_smooth(aes(y=avg_approval_margin)) +
  geom_line(aes(y=ci_margin_up, group=1), color="red", size = 0.3)+
  geom_line(aes(y=ci_margin_dw, group=1), color="red", size = 0.3)+
  
  #add the aesthetics for the graph
  geom_hline(yintercept=0, linetype="solid", color = "orange", size = 1)+
  labs(title="Estimating Approval Margin (approve-disapprove) for Joe Biden",
       subtitle = "Weekly average of all polls",
       x = "Week of the year", 
       y = "Average Approval Margin (Approve - Disapprove") +
  annotate("text", x = 20, y = 20, label = "2021", color = 'black',size = 3)+
  scale_y_continuous(
    labels = scales::number_format(accuracy = 0.1))+
  theme_minimal()+
  NULL

3.2 Compare Confidence Intervals

Compare the confidence intervals for week 3 and week 25. Can you explain what’s going on? One paragraph would be enough.

#calculate the week4 and week5 confidence intervals
margin_ci_comparison <- new_approval %>% 
  filter(week==4 | week==25) %>% 
  group_by(week) %>% 
  summarise(avg_approval_margin = mean(net_approve),
            sd_margin = sd(net_approve, na.rm=TRUE),
            count_margin = n(),
            se_margin = sd_margin / sqrt(count_margin),
            ci_margin_up = avg_approval_margin + qt(.975, count_margin-1)*se_margin ,
            ci_margin_dw = avg_approval_margin - qt(.975, count_margin-1)*se_margin
            )

#plot the confidence intervals
margin_ci_comparison %>% 
  ggplot(aes(x=avg_approval_margin, y=week, color=week))+
  geom_errorbarh(aes(xmin=ci_margin_dw,xmax=ci_margin_up))+
  geom_point(aes(x=avg_approval_margin, y=week), size=6)+
  labs(title="Comparison of confidence intervals for week4 and week 25",
       x = "Average approval margin", 
       y = "Week")

#We can see that the average approval margin for week4 is larger than that in week25. However, the confidence interval in week4 [6.12, 19.4] and week25 [6.36, 11.5] overlap. Therefore, based on the confidence interval graphs generated below, we cannot tell if on average, Biden has significantly lower approval margin in week25 than that in week4.
comparison_approval <- new_approval %>% 
  filter(week==4 | week==25)

t.test(net_approve ~ week, data = comparison_approval)
## 
##  Welch Two Sample t-test
## 
## data:  net_approve by week
## t = -1, df = 4, p-value = 0.3
## alternative hypothesis: true difference in means between group 4 and group 25 is not equal to 0
## 95 percent confidence interval:
##  -13.51   5.61
## sample estimates:
##  mean in group 4 mean in group 25 
##             5.00             8.95
#Based on the result below, we cannot reject the null hypothesis. 

4 Gapminder revisited

Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, you will join a few dataframes with more data than the ‘gapminder’ package. Specifically, you will look at data on

You must use the wbstats package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN, SE.PRM.NENR, NY.GDP.PCAP.KD, and SH.DYN.MORT

library(gapminder)
skim(gapminder)
Data summary
Name gapminder
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent 0 1 FALSE 5 Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1.98e+03 1.73e+01 1952.0 1.97e+03 1.98e+03 1.99e+03 2.01e+03 ▇▅▅▅▇
lifeExp 0 1 5.95e+01 1.29e+01 23.6 4.82e+01 6.07e+01 7.08e+01 8.26e+01 ▁▆▇▇▇
pop 0 1 2.96e+07 1.06e+08 60011.0 2.79e+06 7.02e+06 1.96e+07 1.32e+09 ▇▁▁▁▁
gdpPercap 0 1 7.22e+03 9.86e+03 241.2 1.20e+03 3.53e+03 9.33e+03 1.14e+05 ▇▁▁▁▁
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))

# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")


library(wbstats)

worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
                          indicator = indicators, 
                          start_date = 1960, 
                          end_date = 2016)

# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels,  from the World Bank API 
countries <-  wbstats::wb_cachelist$countries

You have to join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it. * Left Join makes the most sense among the different types of join operations like outer joins - left, right, and full. This is because a left join operation (regardless of there being a match) preserves the original observations, especially when one looks up additional data from another table. Since, we are working with 3 dataframes, while mapping the year and date column with different start/end time frames, it is essential to preserve the original observations in each dataframe.

# tidying HIV data - hiv and life_expectancy dataframes - using pivot_longer() + remving NA values
hiv1 <- hiv %>%
  pivot_longer(2:34, names_to = "year", values_to = "Percentage_HIV_Cases_Age_15_49") %>% 
  drop_na()
skim(hiv1)
Data summary
Name hiv1
Number of rows 3301
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 24 0 154 0
year 0 1 4 4 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Percentage_HIV_Cases_Age_15_49 0 1 1.74 4.09 0.01 0.1 0.3 1.2 26.5 ▇▁▁▁▁
life_expectancy_1 <- life_expectancy %>%
  pivot_longer(2:302, names_to = "year", values_to = "Life_Expectancy") %>% 
  drop_na()
skim(life_expectancy_1)
Data summary
Name life_expectancy_1
Number of rows 55528
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 30 0 187 0
year 0 1 4 4 0 301 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Life_Expectancy 0 1 53 21.7 1.01 32.3 48.7 74.2 94.8 ▁▇▂▅▅
#Removing NA values in worldbank_data
worldbank_data_1 <- worldbank_data %>%
  drop_na()
skim(worldbank_data_1)
Data summary
Name worldbank_data_1
Number of rows 3715
Number of columns 8
_______________________
Column type frequency:
character 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
iso2c 0 1 2 2 0 177 0
iso3c 0 1 3 3 0 177 0
country 0 1 4 30 0 177 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
date 0 1 1998.57 12.65 1970.00 1989.00 2001.0 2009.00 2.02e+03 ▂▃▃▆▇
NY.GDP.PCAP.KD 0 1 12893.37 18110.77 164.46 1514.19 4313.9 17178.37 1.16e+05 ▇▁▁▁▁
SE.PRM.NENR 0 1 85.08 17.09 10.05 81.05 91.7 96.43 1.00e+02 ▁▁▁▂▇
SH.DYN.MORT 0 1 52.96 59.80 2.20 10.90 26.7 76.35 3.70e+02 ▇▂▁▁▁
SP.DYN.TFRT.IN 0 1 3.35 1.82 1.08 1.83 2.7 4.82 8.45e+00 ▇▃▂▂▁
#Left Join of life_expectancy_1 (tidied life_expectancy) and hiv1 (tidied hiv) dataframes 
join_1 <- left_join(life_expectancy_1, hiv1, by = c("country"="country", "year"="year"))
# Converting datatype of column (year) from character type to numeric type
join_1$year = as.numeric(join_1$year)
#Left Join of join_1 (left join of hiv1 and life_expectancy_1 dataframes) and worldbank_data dataframes
join_2 <- left_join(join_1, worldbank_data_1, by = c("country"="country", "year"="date"))
#Left Join of join_1 (left join of hiv1 and life_expectancy_1 dataframes) and worldbank_data dataframes
join_3 <- left_join(join_2, countries, "country"="country") %>%
  drop_na()
join_3
## # A tibble: 1,083 × 25
##    country  year Life_Expectancy Percentage_HIV_Case… iso2c iso3c NY.GDP.PCAP.KD
##    <chr>   <dbl>           <dbl>                <dbl> <chr> <chr>          <dbl>
##  1 Algeria  1990            71.7                 0.06 DZ    DZA            3572.
##  2 Algeria  1991            72.2                 0.06 DZ    DZA            3444.
##  3 Algeria  1992            72.5                 0.06 DZ    DZA            3424.
##  4 Algeria  1993            72.7                 0.06 DZ    DZA            3279.
##  5 Algeria  1994            72.8                 0.06 DZ    DZA            3183.
##  6 Algeria  1995            72.9                 0.06 DZ    DZA            3241.
##  7 Algeria  1996            73.3                 0.06 DZ    DZA            3315.
##  8 Algeria  1997            73.2                 0.06 DZ    DZA            3298.
##  9 Algeria  1999            73.9                 0.06 DZ    DZA            3474.
## 10 Algeria  2000            74                   0.06 DZ    DZA            3558.
## # … with 1,073 more rows, and 18 more variables: SE.PRM.NENR <dbl>,
## #   SH.DYN.MORT <dbl>, SP.DYN.TFRT.IN <dbl>, capital_city <chr>,
## #   longitude <dbl>, latitude <dbl>, region_iso3c <chr>, region_iso2c <chr>,
## #   region <chr>, admin_region_iso3c <chr>, admin_region_iso2c <chr>,
## #   admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## #   income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## #   lending_type <chr>
  1. What is the relationship between HIV prevalence and life expectancy? Generate a scatterplot with a smoothing line to report your results. You may find faceting useful
  • Based on the scatterplot with a smoothing line, it seems like region plays a role in the relationship between HIV prevalence and life expectancy. HIV Prevalence is depicted through the percentage of HIV cases in the age group 15-49 years. In the case of Sub-Saharan Africa, there is a definite pattern of a negative slope which implies that as HIV prevalence increases, there is a decrease in Life Expectancy. In Latin America too, there is a negative slope implying an inverse relationship between HIV prevalence and Life Expectancy. However, there is more variability in the plot for Sub-Saharan Africa as compared to Latin America.
#Scatterplot for Life Expectancy vs. HIV prevalence 
ggplot(join_3, aes(x = Percentage_HIV_Cases_Age_15_49, y = Life_Expectancy)) +
    geom_point(size = 0.3) +
    geom_smooth(method="lm") +
    facet_wrap(~ region) +
    labs(title = "Relationship between HIV Prevalence and Life Expectancy", 
         x = "HIV Prevalence",
         y = "Life Expectancy")

  1. What is the relationship between fertility rate and GDP per capita? Generate a scatterplot with a smoothing line to report your results. You may find facetting by region useful
  • Based on the scatterplot with a smoothing line, it seems like facetting by region solidifies the belief that all regions depict an inverse relationship between fertility rate and GDP per capita. In the case of Sub-Saharan Africa in particular, as the fertility rate increases, the GDP per capita decreases. The inverse relationship between the two variables demonstrates the connection between fertility choices and economic considerations. In general, developing or low-income countries tend to have higher levels of fertility than their developed counterparts for several reasons such as infant mortality, lack of access to contraceptives, generally lower levels of female education, etc.
ggplot(join_3, aes(x = SP.DYN.TFRT.IN, y = NY.GDP.PCAP.KD)) +
    geom_point(size = 0.5) +
    geom_smooth(method="lm") +
    facet_wrap(~ region) +
    labs(title = "Relationship between Fertility Rate and GDP per capita", 
         x = "Fertility Rate",
         y = "GDP per capita")+
  NULL

  1. Which regions have the most observations with missing HIV data? Generate a bar chart (geom_col()), in descending order. Region ‘Sub-Saharan Africa’ has the most observations with missing HIV data. This is followed by Europe & Central Asia with the 2nd most observations with missing HIV data.
#Tidying hiv dataframe
tidy_hiv <- hiv %>% 
  pivot_longer(cols=2:34, names_to="year", values_to = "Percentage_HIV_Cases_Age_15_49")
#Left joining tidy_hiv and countries dataframes
joined_hiv_countries <- left_join(tidy_hiv, countries, "country"= "country")
joined_hiv_countries
## # A tibble: 5,082 × 20
##    country     year  Percentage_HIV_… iso3c iso2c capital_city longitude latitude
##    <chr>       <chr>            <dbl> <chr> <chr> <chr>            <dbl>    <dbl>
##  1 Afghanistan 1979                NA AFG   AF    Kabul             69.2     34.5
##  2 Afghanistan 1980                NA AFG   AF    Kabul             69.2     34.5
##  3 Afghanistan 1981                NA AFG   AF    Kabul             69.2     34.5
##  4 Afghanistan 1982                NA AFG   AF    Kabul             69.2     34.5
##  5 Afghanistan 1983                NA AFG   AF    Kabul             69.2     34.5
##  6 Afghanistan 1984                NA AFG   AF    Kabul             69.2     34.5
##  7 Afghanistan 1985                NA AFG   AF    Kabul             69.2     34.5
##  8 Afghanistan 1986                NA AFG   AF    Kabul             69.2     34.5
##  9 Afghanistan 1987                NA AFG   AF    Kabul             69.2     34.5
## 10 Afghanistan 1988                NA AFG   AF    Kabul             69.2     34.5
## # … with 5,072 more rows, and 12 more variables: region_iso3c <chr>,
## #   region_iso2c <chr>, region <chr>, admin_region_iso3c <chr>,
## #   admin_region_iso2c <chr>, admin_region <chr>, income_level_iso3c <chr>,
## #   income_level_iso2c <chr>, income_level <chr>, lending_type_iso3c <chr>,
## #   lending_type_iso2c <chr>, lending_type <chr>
# Determining NA values in joined_hiv_countries dataframe
joined_hiv_countries %>%
  filter(!is.na(region)) %>%
  group_by(region) %>%
  summarise(missing_hiv_values=sum(is.na(Percentage_HIV_Cases_Age_15_49))) %>%
  mutate(
    region=fct_reorder(region,-missing_hiv_values)) %>%
# Plotting Bar Chart of Region Specific Missing HIV Data in Descending Order
  ggplot(aes(x=region,y=missing_hiv_values))+
  geom_col()+
  labs(title="Bar Chart of Region Specific Missing HIV Data in Descending Order",
    x= "Region",
    y= "Missing NA values in HIV data"
  )

1. How has mortality rate for under 5 changed by region? In each region, find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.

#Minimum Year and Maximum Year determination
mortality <- join_3 %>% 
  group_by(country) %>%
  summarise(minimum_year=min(year), maximum_year=max(year))
mortality
## # A tibble: 92 × 3
##    country    minimum_year maximum_year
##    <chr>             <dbl>        <dbl>
##  1 Algeria            1990         2008
##  2 Angola             1998         2011
##  3 Argentina          1991         2011
##  4 Armenia            2002         2011
##  5 Azerbaijan         1991         2011
##  6 Bangladesh         1990         2010
##  7 Belarus            1995         2011
##  8 Belize             1999         2011
##  9 Benin              1984         2011
## 10 Bhutan             1998         2011
## # … with 82 more rows
# Dataframe with original mortality rates
join_5 <- left_join(join_3, mortality, "country"="country") %>% #Left Joining join_3 and mortality dataframes
select(country, year, minimum_year, maximum_year, SH.DYN.MORT, region) %>% # Selecting required columns
mutate(
  original_mortality = ifelse(year == minimum_year, SH.DYN.MORT, 0)) %>% #Determining original mortality rates
select(!year) %>%
filter(!original_mortality == 0)%>%
select(!SH.DYN.MORT)

join_5
## # A tibble: 92 × 5
##    country    minimum_year maximum_year region                     original_mortal…
##    <chr>             <dbl>        <dbl> <chr>                                 <dbl>
##  1 Algeria            1990         2008 Middle East & North Africa             49.5
##  2 Angola             1998         2011 Sub-Saharan Africa                    214. 
##  3 Argentina          1991         2011 Latin America & Caribbean              28.1
##  4 Armenia            2002         2011 Europe & Central Asia                  27.8
##  5 Azerbaijan         1991         2011 Europe & Central Asia                  95.3
##  6 Bangladesh         1990         2010 South Asia                            144. 
##  7 Belarus            1995         2011 Europe & Central Asia                  15.6
##  8 Belize             1999         2011 Latin America & Caribbean              24.4
##  9 Benin              1984         2011 Sub-Saharan Africa                    199. 
## 10 Bhutan             1998         2011 South Asia                             86.1
## # … with 82 more rows
# Dataframe with final mortality rates
join_6 <- left_join(join_3, mortality, "country"="country") %>% #Left Joining join_3 and mortality dataframes
select(country, year, minimum_year, maximum_year, SH.DYN.MORT, region) %>% # Selecting required columns
mutate(
  final_mortality = ifelse(year == maximum_year, SH.DYN.MORT, 0)) %>%  #Determining final mortality rates
select(!year) %>%
filter(!final_mortality == 0)%>%
select(!SH.DYN.MORT)

join_6
## # A tibble: 92 × 5
##    country    minimum_year maximum_year region                     final_mortality
##    <chr>             <dbl>        <dbl> <chr>                                <dbl>
##  1 Algeria            1990         2008 Middle East & North Africa            29.5
##  2 Angola             1998         2011 Sub-Saharan Africa                   112. 
##  3 Argentina          1991         2011 Latin America & Caribbean             13.9
##  4 Armenia            2002         2011 Europe & Central Asia                 17.6
##  5 Azerbaijan         1991         2011 Europe & Central Asia                 35  
##  6 Bangladesh         1990         2010 South Asia                            48.7
##  7 Belarus            1995         2011 Europe & Central Asia                  5.1
##  8 Belize             1999         2011 Latin America & Caribbean             18.3
##  9 Benin              1984         2011 Sub-Saharan Africa                   109. 
## 10 Bhutan             1998         2011 South Asia                            39.9
## # … with 82 more rows
# Joining aforementioned join_5 and join_6 dataframes along with Improvement in Mortality Rates calculation
join_7 <- left_join(join_5, join_6, "country"="country") %>%  #Left Joining join_5 and join_6
  mutate(
    mortality_improvement = (final_mortality - original_mortality)/original_mortality) %>%
  arrange(desc(mortality_improvement))

join_7
## # A tibble: 92 × 7
##    country   minimum_year maximum_year region   original_mortal… final_mortality
##    <chr>            <dbl>        <dbl> <chr>               <dbl>           <dbl>
##  1 South Af…         1991         2005 Sub-Sah…             56              79.1
##  2 Gabon             1997         1997 Sub-Sah…             87.2            87.2
##  3 Haiti             1997         1997 Latin A…            115.            115. 
##  4 South Su…         2011         2011 Sub-Sah…            102.            102. 
##  5 Sudan             2011         2011 Sub-Sah…             73.4            73.4
##  6 Lesotho           1984         2011 Sub-Sah…            101.             96.1
##  7 Zimbabwe          1998         2003 Sub-Sah…             95.6            90.6
##  8 Sao Tome…         2009         2010 Sub-Sah…             47.9            45  
##  9 Uzbekist…         2007         2008 Europe …             41.3            38.5
## 10 Ethiopia          2009         2011 Sub-Sah…             86.7            77.6
## # … with 82 more rows, and 1 more variable: mortality_improvement <dbl>
# Determining Change in Mortality Rate for Under 5 in each region
mortality_improvement_by_region <- join_7 %>%
  group_by(region) %>%
  summarise(mean_mortality_improvement_rate = 100 * mean(mortality_improvement))

mortality_improvement_by_region
## # A tibble: 6 × 2
##   region                     mean_mortality_improvement_rate
##   <chr>                                                <dbl>
## 1 East Asia & Pacific                                  -40.0
## 2 Europe & Central Asia                                -48.1
## 3 Latin America & Caribbean                            -48.9
## 4 Middle East & North Africa                           -56.5
## 5 South Asia                                           -48.0
## 6 Sub-Saharan Africa                                   -32.0
# Plot for Region-Specific Change in Mortality Rate for Under 5
ggplot(mortality_improvement_by_region, aes(x = mean_mortality_improvement_rate, y = fct_reorder(region, -mean_mortality_improvement_rate))) +
  geom_col()+
  labs(
    title= "Region-Specific Change in Mortality Rate for Under 5",
    x= "Change in Mortality Rate for Under 5",
    y="Region"
  )

#Sub-Saharan Africa
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "Sub-Saharan Africa") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country      minimum_year maximum_year region original_mortal… final_mortality
##   <chr>               <dbl>        <dbl> <chr>             <dbl>           <dbl>
## 1 South Africa         1991         2005 Sub-S…             56              79.1
## 2 Gabon                1997         1997 Sub-S…             87.2            87.2
## 3 South Sudan          2011         2011 Sub-S…            102.            102. 
## 4 Sudan                2011         2011 Sub-S…             73.4            73.4
## 5 Lesotho              1984         2011 Sub-S…            101.             96.1
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "Sub-Saharan Africa") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country    minimum_year maximum_year region   original_mortal… final_mortality
##   <chr>             <dbl>        <dbl> <chr>               <dbl>           <dbl>
## 1 Senegal            1981         2011 Sub-Sah…             198.            63  
## 2 Niger              1990         2011 Sub-Sah…             329.           115. 
## 3 Eritrea            1992         2011 Sub-Sah…             138.            53.2
## 4 Mozambique         1990         2011 Sub-Sah…             243.           100. 
## 5 Tanzania           1990         2010 Sub-Sah…             165.            71.9
## # … with 1 more variable: mortality_improvement <dbl>
#South Asia
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "South Asia") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country   minimum_year maximum_year region     original_mortal… final_mortality
##   <chr>            <dbl>        <dbl> <chr>                 <dbl>           <dbl>
## 1 Pakistan          2002         2011 South Asia            102.             85  
## 2 Sri Lanka         2001         2011 South Asia             16              11.2
## 3 Nepal             1999         2011 South Asia             85.8            44.2
## 4 India             1990         2009 South Asia            126.             61.4
## 5 Bhutan            1998         2011 South Asia             86.1            39.9
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "South Asia") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country    minimum_year maximum_year region     original_mortal… final_mortality
##   <chr>             <dbl>        <dbl> <chr>                 <dbl>           <dbl>
## 1 Maldives           1997         2008 South Asia             52.8            16  
## 2 Bangladesh         1990         2010 South Asia            144.             48.7
## 3 Bhutan             1998         2011 South Asia             86.1            39.9
## 4 India              1990         2009 South Asia            126.             61.4
## 5 Nepal              1999         2011 South Asia             85.8            44.2
## # … with 1 more variable: mortality_improvement <dbl>
#Latin America & Caribbean
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "Latin America & Caribbean") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country            minimum_year maximum_year region original_mortal… final_mortality
##   <chr>                     <dbl>        <dbl> <chr>             <dbl>           <dbl>
## 1 Haiti                      1997         1997 Latin…            115.            115. 
## 2 Suriname                   2005         2011 Latin…             26.5            22.5
## 3 Dominican Republic         1999         2011 Latin…             42              33.7
## 4 Belize                     1999         2011 Latin…             24.4            18.3
## 5 Costa Rica                 1990         2011 Latin…             16.9            10.2
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "Latin America & Caribbean") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country   minimum_year maximum_year region    original_mortal… final_mortality
##   <chr>            <dbl>        <dbl> <chr>                <dbl>           <dbl>
## 1 Ecuador           1979         2011 Latin Am…             96.1            17.5
## 2 Honduras          1979         2008 Latin Am…             99.3            25.5
## 3 Peru              1993         2011 Latin Am…             67.1            18.9
## 4 Nicaragua         1986         2010 Latin Am…             78.7            23.9
## 5 Colombia          1984         2011 Latin Am…             45.1            17.8
## # … with 1 more variable: mortality_improvement <dbl>
#Europe & Central Asia
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "Europe & Central Asia") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country    minimum_year maximum_year region   original_mortal… final_mortality
##   <chr>             <dbl>        <dbl> <chr>               <dbl>           <dbl>
## 1 Uzbekistan         2007         2008 Europe …             41.3            38.5
## 2 Serbia             2005         2011 Europe …              8.9             7.4
## 3 Ukraine            2002         2011 Europe …             16.7            11.2
## 4 Armenia            2002         2011 Europe …             27.8            17.6
## 5 Bulgaria           1996         2011 Europe …             19.2            10.4
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "Europe & Central Asia") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country    minimum_year maximum_year region   original_mortal… final_mortality
##   <chr>             <dbl>        <dbl> <chr>               <dbl>           <dbl>
## 1 Turkey             1990         2011 Europe …             73.9            17  
## 2 Georgia            1995         2011 Europe …             45.2            13  
## 3 Belarus            1995         2011 Europe …             15.6             5.1
## 4 Azerbaijan         1991         2011 Europe …             95.3            35  
## 5 Kazakhstan         2000         2011 Europe …             42.2            18.3
## # … with 1 more variable: mortality_improvement <dbl>
#Middle East & North Africa
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "Middle East & North Africa") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 3 × 7
##   country minimum_year maximum_year region      original_mortal… final_mortality
##   <chr>          <dbl>        <dbl> <chr>                  <dbl>           <dbl>
## 1 Algeria         1990         2008 Middle Eas…             49.5            29.5
## 2 Morocco         1990         2011 Middle Eas…             79.1            30.4
## 3 Tunisia         1990         2011 Middle Eas…             55.3            18  
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "Middle East & North Africa") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 3 × 7
##   country minimum_year maximum_year region      original_mortal… final_mortality
##   <chr>          <dbl>        <dbl> <chr>                  <dbl>           <dbl>
## 1 Tunisia         1990         2011 Middle Eas…             55.3            18  
## 2 Morocco         1990         2011 Middle Eas…             79.1            30.4
## 3 Algeria         1990         2008 Middle Eas…             49.5            29.5
## # … with 1 more variable: mortality_improvement <dbl>
#East Asia & Pacific
join_7 %>% #The Top 5 countries with the greatest improvement in mortality rates
  filter(region == "East Asia & Pacific") %>%
  slice_max(order_by = mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country     minimum_year maximum_year region  original_mortal… final_mortality
##   <chr>              <dbl>        <dbl> <chr>              <dbl>           <dbl>
## 1 Fiji                1992         2011 East A…             26.9            23.6
## 2 Thailand            2006         2009 East A…             16.4            14.2
## 3 Myanmar             2000         2010 East A…             89              63.4
## 4 Vietnam             1998         2011 East A…             32.8            22.6
## 5 Philippines         1992         2009 East A…             50.4            32.1
## # … with 1 more variable: mortality_improvement <dbl>
join_7 %>% #The Top 5 countries with the least improvement in mortality rates
  filter(region == "East Asia & Pacific") %>%
  slice_max(order_by = -mortality_improvement, n = 5)
## # A tibble: 5 × 7
##   country     minimum_year maximum_year region  original_mortal… final_mortality
##   <chr>              <dbl>        <dbl> <chr>              <dbl>           <dbl>
## 1 Mongolia            1995         2011 East A…             88.1            27.9
## 2 Cambodia            1997         2011 East A…            120.             40.6
## 3 Indonesia           1990         2011 East A…             84              32.6
## 4 Malaysia            1994         2011 East A…             13.9             8  
## 5 Philippines         1992         2009 East A…             50.4            32.1
## # … with 1 more variable: mortality_improvement <dbl>
  1. Is there a relationship between primary school enrollment and fertility rate? Yes, there seems to be an inverse relationship between primary school enrollment and fertility rate.
ggplot(join_3, aes(x = SE.PRM.NENR, y = SP.DYN.TFRT.IN)) +
    geom_point() +
    geom_smooth(method="lm") +
    labs(title = "Relationship between Primary School Enrollment and Fertility Rate", 
         x = "Primary School Enrollment",
         y = "Fertility Rate")

5 Challenge 1: Excess rentals in TfL bike sharing

Recall the TfL data on how many bikes were hired every single day. We can get the latest data by running the following

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-08-23T14%3A32%3A29/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20210912%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20210912T211134Z&X-Amz-Expires=300&X-Amz-Signature=213cc5c78a7babb89325fd0e572fabaedd5ec17b0ab3b39c1cccc719b6f71386&X-Amz-SignedHeaders=host]
##   Date: 2021-09-12 21:11
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 173 kB
## <ON DISK>  /var/folders/7g/r4r0bq0n6v7d8lh2hqs28_mm0000gn/T//RtmpOPecfM/file1641a572a6ba5.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

Look at May and Jun and compare 2020 with the previous years. What’s happening?

  • According to the above graph, most of the months are of a right skewed distribution while May and June are close to a normal distribution. Being right skewed means that the average bike rental value is smaller than that of which distribution is normal. During summer, it is both predicted that bike rentals go up, and this explains the observation for May and June in this graph. It can also be seen that for the lower hire counts, their frequencies are very low. This indicates that bike rental was so popular that the very few low hire counts were observed.

  • For year 2020, the pandemic obviously had heavily struck the bike rental industry. Overall, the frequency of renting a bike has decreased significantly compared to the other years. However, it is interesting to find that for the lower hire counts in May and June, their frequency increased in 2020 with regard to previous years. As it gets less popular for people to go out and rent bikes, the lower hire counts would appear more frequently than before. In fact, all the plots for 2020 are more right skewed than they were in the other years, as the data set shifted to lower hire counts due to the pandemic.

However, the challenge I want you to work on is to reproduce the following two graphs.

The second one looks at percentage changes from the expected level of weekly rentals. The two grey shaded rectangles correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

Should you use the mean or the median to calculate your expected rentals? Why?

  • We will use the mean to calculate our expected rentals. As the distributions are mostly right-skewed, the median will be smaller than the mean and it will not be an accurate representation of the data group.

In creating your plots, you may find these links useful:

First, we take a glimpse of the dataset.

glimpse(bike)
## Rows: 4,020
## Columns: 5
## $ day         <dttm> 2010-07-30, 2010-07-31, 2010-08-01, 2010-08-02, 2010-08-0…
## $ bikes_hired <dbl> 6897, 5564, 4303, 6642, 7966, 7893, 8724, 9797, 6631, 7864…
## $ year        <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010…
## $ month       <ord> Jul, Jul, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug, Aug…
## $ week        <dbl> 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 32, 32, 32, 32, 32…

Then we look at the data if there’s missing values.

skim(bike)
Data summary
Name bike
Number of rows 4020
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 3
POSIXct 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
month 0 1 TRUE 12 Jul: 343, Jan: 341, Mar: 341, May: 341

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bikes_hired 0 1 26080.7 9673.5 2764 19138 25884 32986 73094 ▃▇▅▁▁
year 0 1 2015.6 3.2 2010 2013 2016 2018 2021 ▇▆▆▆▇
week 0 1 26.6 15.1 1 14 27 40 53 ▇▇▇▇▇

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
day 0 1 2010-07-30 2021-07-31 2016-01-29 12:00:00 4020
#We can see that there are no missing values. 
#Prepare the dataframe for the first graph (monthly bike rental)
bike$day = as.Date(bike$day)

#filter the data for 2016-2019
history_bike <- bike %>% 
  filter(between(day, as.Date("2016-01-01"), as.Date("2019-12-30")))

#calculate the overall monthly average between 2016-2019 as benchmark
expected_bike_pcm <- history_bike %>% 
  group_by(month) %>% 
  summarise(expected_rental = mean(bikes_hired))

#calculate the excess rentals and percentage for 2016-2021
#excess_rentals = actual_rentals - expected_rentals
actual_bike_pcm <- bike %>% 
  filter(between(day, as.Date("2016-01-01"), as.Date("2021-12-30"))) %>% 
  group_by(year,month) %>% 
  summarise(actual_rental = mean(bikes_hired)) %>% 
  left_join(expected_bike_pcm, by = "month") %>% 
  mutate(excess_rentals = actual_rental - expected_rental)
#Prepare the dataframe for the second graph (weekly bike rental)

#calculate the overall weekly average between 2016-2019 as benchmark
expected_bike_pw <- history_bike %>% 
  group_by(week) %>% 
  summarise(expected_rentals = mean(bikes_hired))

#calculate the excess rentals and percentage for 2016-2021
actual_bike_pw <- bike %>% 
  filter(between(day, as.Date("2016-01-01"), as.Date("2021-12-30"))) %>% 
  group_by(year,week) %>% 
  summarise(actual_rental = mean(bikes_hired)) %>% 
  left_join(expected_bike_pw, by = "week") %>% 
  mutate(excess_rentals = actual_rental - expected_rentals, 
         excess_rentals_pct = excess_rentals/expected_rentals) 
month_plot <- actual_bike_pcm %>% 
  ggplot(aes(x = month)) +
  
  #Plot the actual and expected lines
  geom_line(aes(y = actual_rental, group = 1), color = "black", size = 0.2)+
  geom_line(aes(y = expected_rental, group = 1), color = "blue", size = 0.6)+
  
  #Colourise the plots using ifelse function
  geom_ribbon(aes(group = 1, ymin = ifelse(actual_rental <= expected_rental, actual_rental, expected_rental),ymax = expected_rental), 
              fill = "palevioletred3", alpha = 0.4)+
  geom_ribbon(aes(group = 1, ymin = ifelse(actual_rental > expected_rental, expected_rental, actual_rental),ymax = actual_rental), 
              fill = "green", alpha = 0.4)+
  facet_wrap(~year)+
  theme_minimal()+
  theme(axis.text.x = element_text(size = 5), axis.title.y = element_text(size=9), plot.title = element_text(size=9),
        plot.subtitle = element_text(size=9),  plot.caption = element_text(size=5))+
  labs(title = "Monthly changes in TfL bike rentals", subtitle = "Change from monthly average shown in blue
and calculated between 2016-2019",y= "Bike rentals", x = "", caption = "Source: TfL, London Data Store")+
  NULL


month_plot

actual_bike_pw <- actual_bike_pw %>% 
  filter(!(year=="2021" & week > 27))

x_axis_color = ifelse(actual_bike_pw$excess_rentals_pct > 0 , "green", "red")

week_plot <- actual_bike_pw %>% 
  ggplot(aes(x = week)) +
  
  #Plot the excess rental line
  geom_line(aes(group = 1, y = excess_rentals_pct), color = "black", size = 0.3)+
  
  #Fill the area between the x-axis and the line
  geom_ribbon(aes(group = 1, ymin = ifelse(excess_rentals_pct >0, 0, excess_rentals_pct), ymax = excess_rentals_pct),
              fill = "green", alpha = 0.3)+
  geom_ribbon(aes(group = 1, ymin = ifelse(excess_rentals_pct <=0, excess_rentals_pct, 0), ymax = 0),
              fill = "palevioletred3", alpha = 0.3)+
  
  #Plot the coloured ticks on x-axis
  geom_rug(color = ifelse(actual_bike_pw$excess_rentals_pct > 0 , "green", "palevioletred3"), alpha = 0.9, size = 0.3) +
  
  #Turn y axis into percentage scale and format the major ticks on x-axis according to the sample plot
  scale_y_continuous(labels = scales::percent, limits = c(-0.6,1.1))+
  scale_x_continuous(breaks = c(13,26,39,53), limits = c(0,53))+
  
  facet_wrap(~year)+
  theme_minimal()+
  
  #Plot the gray rectangles in the grid
  annotate("rect",xmin = 13, xmax = 26, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3) +
  annotate("rect",xmin = 39, xmax = 53, ymin = -Inf, ymax = Inf, fill = "grey", alpha = 0.3) +
  
  #Labels
  labs(title = "Weekly changes in TfL bike rentals", subtitle = "% change from weekly averages 
calculated between 2016-2019",
       y= "", x = "week", caption = "Source: TfL, London Data Store")+
  theme(axis.text.x = element_text(size = 10), axis.title.x = element_text(size=9), plot.title = element_text(size=9),
        plot.subtitle = element_text(size=9),  plot.caption = element_text(size=5))+
  NULL

week_plot

# Deliverables

As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.

6 Details

  • Who did you collaborate with: Linli Ding, Yifan Yang, Clemens Scherf, Chandrima Tolia, Edoardo Ferri, Hamish Thomas
  • Approximately how much time did you spend on this problem set: 2 days
  • What, if anything, gave you the most trouble: Replicating the plots

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else? * Yes